© This material is licensed under a Creative Commons CC BY-NC-ND attribution which allows works to be downloaded and shared with others, as long as they are referenced, but may not be modified in any way or used commercially.
Price research within the housing market stands as a pivotal compass guiding numerous stake-holders – buyers, sellers, investors and policymakers. The dynamics of property prices encapsulate a multitude of intricate factors, dictating not just market trends but also reflecting the societal, economic and geographical characteristics of a region. In this vein, the Ames Housing Dataset assumes significance as a comprehensive repository, offering a nuanced lens into the complex tapestry of housing attributes and sale prices in Ames, Iowa.
The study analysis a subset of the original dataset. It is important to underscore that while the chosen variables offer a glimpse into critical housing attributes, their selection was driven primarily by practical constraints of time and feasibility. The chosen variables predominantly orbit around the intrinsic aspects of properties – encompassing dimensions such as property quality, condition, size measurements and temporal data. This focus within the dataset allows to unravel the essence of housing valuations through an inspection of individual property features, steering the analytical spotlight away from broader neighborhood influences.
In pursuit of this understanding, this study harnesses a blend of univariate and multivariate analytical techniques within the R programming environment. The numerical analyses is accompanied by graphical visualization.
The initial phase of the analysis involves a univariate examination of the chosen variables. This step encompasses the scrutiny of missing data, classical numeric descriptive analysis, outlier detection and the assessment of normality assumptions. This step-by-step dissection offers a foundational understanding of the individual behaviors and distributions of these critical housing attributes.
Subsequently, the study transitions into a multivariate analysis, delving deeper into the complex interplay of these chosen variables. Employing correlation studies, the aim is to unveil intricate relationships and potential interdependencies among these property-specific attributes and the ultimate sale prices. Dimensionality reduction techniques, such as Principal Component Analysis (PCA), will be employed to distill the essence of these variables, potentially uncovering latent factors that collectively contribute to housing valuations. Additionally, there is a contemplation of employing clustering algorithms to identify possible groupings or patterns within the dataset.
The core objective of this research is deeply understand the nuanced relationships between property attributes and sale prices themselves. By prioritizing this understanding, the study aims to reveal the most significant variables, those attributes that exhibit the strongest association or influence on housing sale prices within the local Ames real estate market. It is also aimed to construct classification methods in order to predict housing sale prices.
Loading/installation of R packages necessary for this practice.
The following source code module is responsible for loading, if they are already installed, all the packages that will be used in this R session.
#########################################
# Loading necessary packages and reason #
#########################################
# Package required to call 'ggplot' function
library(ggplot2)
# Package required to call 'ggarrange' function
library(ggpubr)
# Package required to call 'scatterplot3d' function
library(scatterplot3d)
# Package required to call 'melt' function
library(reshape2)
# Package required to call 'mvn' function
#library(MVN)
# Package required to call 'boxM' function
library(biotools)
# Package required to call 'partimat' function
library(klaR)
# Package required to call 'summarise' function
library(dplyr)
# Package required to call 'createDataPartition' function
library(caret)
# Package required to call 'freq' and 'descr' functions (descriptive statistics)
library(summarytools)
# Package required to call 'ggplot' function (graphical tools)
library(ggplot2)
# Package required to call 'ggarrange' function (graphical tools)
library(ggpubr)
# Package required to call 'read.spss' function (loading '.spss' data format)
library(foreign)
# Package required to call 'read_xlsx' function (loading '.xlsx' data format)
library(readxl)
# Package required to load the data set 'RBGlass1'
library(archdata)
# Package required to call 'cortest.bartlett' function
library(psych)
# Package required to call 'fviz_pca_var, fviz_pca_ind and fviz_pca' functions
library(factoextra)
# Package required to call 'scatterplot3d' function
library(scatterplot3d)
# Package required to call 'hetcor' function
library(polycor)
# Package required to call 'ggcorrplot' function
library(ggcorrplot)
# Package required to call 'corrplot' function
library(corrplot)
# Package required to call 'rplot' function
library(corrr)
# Package required to call 'mvn' function
library(MVN)
# Package required to plot ROC curve
library(pROC)
library(caret)
# Package required to call 'mutate' function
library(tidyverse)
# Package required to call 'clusGap' function
library(cluster)
# Package required to call 'get_dist', 'fviz_cluster' and 'fviz_dist' functions
library(factoextra)
# Package required to call 'ggdendrogram' function
library(ggdendro)
# Package required to call 'grid.arrange' function
library(gridExtra)
library(mclust)
library(moments)
In this chunk we will load the dataset.
# Loading the .csv file
# The output of this function is already a data.frame object
data<-read.csv("train.csv", header = TRUE,sep =",", fileEncoding = "UTF-8")
In this phase we carry out a preliminary exploratory analysis of the data. To do this, we apply different numerical and graphical techniques used in class. At first, we will focus on the analysis of each variable independently without yet looking for possible interactions between them, in order to detect: data groupings, missing values, classical numeric descriptive analysis, extreme values, assumption of normality.
# This line loads the variable names from this data.frame
# So that we can access by their name with no refer to the data.frame identifier
attach(data)
# We will only select 17 variables for our study
data <- data[, c(
"MSSubClass",
"MSZoning",
"LotFrontage",
"LotArea",
"OverallQual",
"OverallCond",
"YearBuilt",
"ExterQual",
"ExterCond",
"GarageQual",
"GrLivArea",
"X1stFlrSF",
"X2ndFlrSF",
"YrSold",
"SaleType",
"SaleCondition",
"SalePrice"
)]
# Retrieving the name of the selected variables
colnames(data)
## [1] "MSSubClass" "MSZoning" "LotFrontage" "LotArea"
## [5] "OverallQual" "OverallCond" "YearBuilt" "ExterQual"
## [9] "ExterCond" "GarageQual" "GrLivArea" "X1stFlrSF"
## [13] "X2ndFlrSF" "YrSold" "SaleType" "SaleCondition"
## [17] "SalePrice"
# Displaying a few records
head(data)
Now we show the number of missing values per column
# Number of NA's values per column
na_counts <- colMeans(is.na(data))
print(na_counts)
## MSSubClass MSZoning LotFrontage LotArea OverallQual
## 0.00000000 0.00000000 0.17739726 0.00000000 0.00000000
## OverallCond YearBuilt ExterQual ExterCond GarageQual
## 0.00000000 0.00000000 0.00000000 0.00000000 0.05547945
## GrLivArea X1stFlrSF X2ndFlrSF YrSold SaleType
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## SaleCondition SalePrice
## 0.00000000 0.00000000
It identifies the type of dwelling involved in the sale. (Discrete Numerical)
# Classical numeric descriptive analysis
describe(MSSubClass)
freq(MSSubClass)
## Frequencies
## MSSubClass
## Type: Integer
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 20 536 36.71 36.71 36.71 36.71
## 30 69 4.73 41.44 4.73 41.44
## 40 4 0.27 41.71 0.27 41.71
## 45 12 0.82 42.53 0.82 42.53
## 50 144 9.86 52.40 9.86 52.40
## 60 299 20.48 72.88 20.48 72.88
## 70 60 4.11 76.99 4.11 76.99
## 75 16 1.10 78.08 1.10 78.08
## 80 58 3.97 82.05 3.97 82.05
## 85 20 1.37 83.42 1.37 83.42
## 90 52 3.56 86.99 3.56 86.99
## 120 87 5.96 92.95 5.96 92.95
## 160 63 4.32 97.26 4.32 97.26
## 180 10 0.68 97.95 0.68 97.95
## 190 30 2.05 100.00 2.05 100.00
## <NA> 0 0.00 100.00
## Total 1460 100.00 100.00 100.00 100.00
# Histogram and density plots
p1<-ggplot(data,aes(x=MSSubClass))+geom_histogram()+
labs(title = "Histogram of MSSubClass",x="MSSubClass",y="Values")
p1
# Missing Values
na_counts <- sum(is.na(data$MSSubClass))
print(na_counts)
## [1] 0
There are no missing values.
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = MSSubClass)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of MSSubClass", x = "Values", y = "")
# Count outliers
outliers <- boxplot.stats(data$MSSubClass)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 103
# Identify rows with outliers
outlier_rows <- which(data$MSSubClass %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 10 49 57 76 88 94 116 126 146 166 173 181 194 196 226 228 233 236 244 247 286 292 301 313 336 345 349 364 412 431 433 435 473 489 490 491 501 505 521 536 579 600 604 615 624 636 638 650 656 676 686 688 704 706 714 756 759 830 832 838 862 915 916 957 960 963 970 972 976 986 1008 1030 1031 1039 1040 1063 1069 1087 1089 1092 1105 1145 1161 1173 1187 1191 1192 1220 1237 1266 1267 1292 1298 1305 1335 1359 1365 1368 1379 1394 1417 1450 1453
p1
There are 103 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1357
It isn’t continuous so we skip this step.
It identifies the general zoning classification of the sale. (Categorical variable)
# Basic descriptive statistics
freq(MSZoning)
## Frequencies
## MSZoning
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## C (all) 10 0.68 0.68 0.68 0.68
## FV 65 4.45 5.14 4.45 5.14
## RH 16 1.10 6.23 1.10 6.23
## RL 1151 78.84 85.07 78.84 85.07
## RM 218 14.93 100.00 14.93 100.00
## <NA> 0 0.00 100.00
## Total 1460 100.00 100.00 100.00 100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = MSZoning))+geom_bar()+
coord_polar("y")+labs(x="MSZoning",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = MSZoning))+geom_bar()+
labs(x="MSZoning",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
There aren’t any missing values, so we will continue with the next variable.
Linear feet of street connected to property
# Classical numeric descriptive analysis
describe(LotFrontage)
# Histogram and density plots
p1<-ggplot(data,aes(x=LotFrontage))+geom_density()+
labs(title = "Density of LotFrontage",x="LotFrontage",y="Values")
p2<-ggplot(data,aes(x=LotFrontage))+geom_histogram()+
labs(title = "Histogram of LotFrontage",x="LotFrontage",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
### Missing values
# Missing Values
na_counts <- sum(is.na(data$LotFrontage))
print(na_counts)
## [1] 248
print(na_counts/nrow(data))
## [1] 0.1827561
248 missing values -> 18.28%
It is a continuous variable, so we will analyze the random pattern with a Student test
We analyze the random pattern. In order to do this, we will study the homogeneity according to groups NA and non NA with a t-test.
Null Hypothesis (H0): The null hypothesis in a t-test is that there is no significant difference between the means of the two groups. Alternative Hypothesis (H1): The alternative hypothesis is that the true difference in means is not equal to 0.
# Create a subset for non-missing values
non_missing_values <- data$LotFrontage[!is.na(data$LotFrontage)]
# Perform a t-test for continuous variables
result <- t.test(non_missing_values, data$LotFrontage, alternative = "two.sided")
print(paste("Variable: LotFrontage"))
## [1] "Variable: LotFrontage"
print(result)
##
## Welch Two Sample t-test
##
## data: non_missing_values and data$LotFrontage
## t = 0, df = 2216, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.890914 1.890914
## sample estimates:
## mean of x mean of y
## 72.62489 72.62489
t-statistic: The t-value is 0. The t-value measures the difference between the means of the two groups relative to the variability in the data. A t-value of 0 suggests that there is no significant difference between the means. Degrees of Freedom (df): The degrees of freedom for the Welch Two Sample t-test is 2216. p-value: The p-value is 1. This is the probability of observing a t-statistic as extreme as the one computed from the sample, assuming that the null hypothesis is true. A p-value of 1 suggests that there is no evidence to reject the null hypothesis. Confidence Interval: The 95 percent confidence interval for the difference in means is from -1.890914 to 1.890914. This interval contains 0, further supporting the idea that there is no significant difference between the means. Interpretation: In summary, based on the provided p-value (1), we fail to reject the null hypothesis. There is no significant evidence to suggest that the means of the two groups (missing and non-missing values of LotFrontage) are different.
In the case of homogeneity the pattern is random and, in this scenario, it is chosen to replace the NA with the mean
# Calculate the mean of non-missing values
mean_lot_frontage <- mean(data$LotFrontage, na.rm = TRUE)
# Replace missing values with the mean
data$LotFrontage[is.na(data$LotFrontage)] <- mean_lot_frontage
na_counts <- sum(is.na(data$LotFrontage))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
describe(data$LotFrontage)
# Histogram and density plots
p1<-ggplot(data,aes(x=data$LotFrontage))+geom_density()+
labs(title = "Density of LotFrontage",x="LotFrontage",y="Values")
p2<-ggplot(data,aes(x=data$LotFrontage))+geom_histogram()+
labs(title = "Histogram of LotFrontage",x="LotFrontage",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = LotFrontage)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of LotFrontage", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$LotFrontage)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 47
# Identify rows with outliers
outlier_rows <- which(data$LotFrontage %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 36 41 82 83 142 151 162 184 216 241 257 258 270 280 284 290 297 401 415 485 614 751 769 829 846 849 872 898 901 922 940 987 1026 1027 1047 1070 1086 1087 1090 1099 1125 1183 1206 1243 1245 1264 1266
There are 47 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1310
# Histogram representation of the column LotFrontage
par(mfcol = c(1, 1))
# Define the column name
j0 <- "LotFrontage"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)
# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)
# Representation of normal quantiles for the column LotFrontage
par(mfrow = c(1, 1))
# Define the column name
j0 <- "LotFrontage"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])
This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.
The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.
data_tidy <- melt(data, value.name = "value")
result <- aggregate(value ~ variable, data = data_tidy,
FUN = function(x) shapiro.test(x)$p.value)
# Print the row corresponding to LotFrontage
print(subset(result, variable == "LotFrontage"))
## variable value
## 2 LotFrontage 2.510955e-12
There is evidence of lack of univariate normality (p-value < 0.05).
Lot size in square feet
# Classical numeric descriptive analysis
describe(LotArea)
# Histogram and density plots
p1<-ggplot(data,aes(x=LotArea))+geom_density()+
labs(title = "Density of LotArea",x="LotArea",y="Values")
p2<-ggplot(data,aes(x=LotArea))+geom_histogram()+
labs(title = "Histogram of LotArea",x="LotArea",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$LotArea))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
0 missing values -> 0%
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = LotArea)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of LotArea", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$LotArea)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 59
# Identify rows with outliers
outlier_rows <- which(data$LotArea %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 50 62 104 110 114 161 167 221 242 268 305 306 320 327 340 372 377 401 407 471 485 499 504 528 590 593 595 597 619 620 632 651 692 749 766 772 793 829 850 854 943 952 1061 1102 1124 1131 1139 1145 1155 1174 1207 1238 1250 1263 1276 1283 1287 1299 1304
There are 59 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1251
# Histogram representation of the column LotArea
par(mfcol = c(1, 1))
# Define the column name
j0 <- "LotArea"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)
# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)
# Representation of normal quantiles for the column LotArea
par(mfrow = c(1, 1))
# Define the column name
j0 <- "LotArea"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])
This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.
The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.
data_tidy <- melt(data, value.name = "value")
# Use the aggregate function with the correct FUN argument
result <- aggregate(value ~ variable, data = data_tidy,
FUN = function(x) shapiro.test(x)$p.value)
# Print the row corresponding to LotArea
print(subset(result, variable == "LotArea"))
## variable value
## 3 LotArea 7.59414e-05
There is evidence of lack of univariate normality (p-value < 0.05).
Rates the overall material and finish of the house (1 - Very Poor , 10 - Very Excellent) (Ordinal Numerical Discrete)
# Classical numeric descriptive analysis
describe(OverallQual)
# Histogram and density plots
p1<-ggplot(data,aes(x=OverallQual))+geom_density()+
labs(title = "Density of OverallQual",x="OverallQual",y="Values")
p2<-ggplot(data,aes(x=OverallQual))+geom_histogram()+
labs(title = "Histogram of OverallQual",x="OverallQual",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$OverallQual))
print(na_counts)
## [1] 0
No missing values.
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = OverallQual)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of OverallQual", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$OverallQual)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 2
# Identify rows with outliers
outlier_rows <- which(data$OverallQual %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 317 455
There are 2 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1249
It isn’t a continuous variable so we skip this step.
Rates the overall condition of the house (1 - Very Poor , 10 - Very Excellent)
# Classical numeric descriptive analysis
describe(OverallCond)
# Histogram and density plots
p1<-ggplot(data,aes(x=OverallCond))+geom_density()+
labs(title = "Density of OverallCond",x="OverallCond",y="Values")
p2<-ggplot(data,aes(x=OverallCond))+geom_histogram()+
labs(title = "Histogram of OverallCond",x="OverallCond",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$OverallCond))
print(na_counts)
## [1] 0
No missing values.
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = OverallCond)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of OverallCond", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$OverallCond)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 106
# Identify rows with outliers
outlier_rows <- which(data$OverallCond %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 2 15 24 78 81 84 102 160 165 166 190 208 214 230 256 273 294 298 319 324 326 330 338 342 350 362 369 372 382 391 411 420 428 432 443 452 484 498 506 519 531 545 576 577 594 606 608 632 636 640 645 677 722 723 726 735 740 757 769 789 791 812 819 821 836 841 846 851 859 864 865 886 890 894 920 924 938 964 965 971 978 984 989 992 994 995 1008 1011 1021 1031 1041 1050 1052 1054 1081 1083 1089 1132 1139 1160 1182 1186 1196 1209 1229 1247
There are 106 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1143
It isn’t a continuous variable so we skip this step.
Original construction date
# Classical numeric descriptive analysis
describe(YearBuilt)
# Histogram and density plots
p1<-ggplot(data,aes(x=YearBuilt))+geom_density()+
labs(title = "Density of YearBuilt",x="YearBuilt",y="Values")
p2<-ggplot(data,aes(x=YearBuilt))+geom_histogram()+
labs(title = "Histogram of YearBuilt",x="YearBuilt",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$YearBuilt))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
0 missing values
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = YearBuilt)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of YearBuilt", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$YearBuilt)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 5
# Identify rows with outliers
outlier_rows <- which(data$YearBuilt %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 89 499 589 893 1058
There are 5 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1138
It isn’t a continuous variable so we skip this step.
Evaluates the quality of the material on the exterior
# Basic descriptive statistics
freq(ExterQual)
## Frequencies
## ExterQual
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## Ex 52 3.56 3.56 3.56 3.56
## Fa 14 0.96 4.52 0.96 4.52
## Gd 488 33.42 37.95 33.42 37.95
## TA 906 62.05 100.00 62.05 100.00
## <NA> 0 0.00 100.00
## Total 1460 100.00 100.00 100.00 100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = ExterQual))+geom_bar()+
coord_polar("y")+labs(x="ExterQual",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = ExterQual))+geom_bar()+
labs(x="ExterQual",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
There aren’t any missing values, so we will continue with the next
variable.
Evaluates the present condition of the material on the exterior
# Basic descriptive statistics
freq(ExterCond)
## Frequencies
## ExterCond
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## Ex 3 0.205 0.205 0.205 0.205
## Fa 28 1.918 2.123 1.918 2.123
## Gd 146 10.000 12.123 10.000 12.123
## Po 1 0.068 12.192 0.068 12.192
## TA 1282 87.808 100.000 87.808 100.000
## <NA> 0 0.000 100.000
## Total 1460 100.000 100.000 100.000 100.000
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = ExterCond))+geom_bar()+
coord_polar("y")+labs(x="ExterCond",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = ExterCond))+geom_bar()+
labs(x="ExterCond",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
There aren’t any missing values, so we will continue with the next
variable.
Garage quality
# Basic descriptive statistics
freq(GarageQual)
## Frequencies
## GarageQual
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## Ex 3 0.22 0.22 0.21 0.21
## Fa 48 3.48 3.70 3.29 3.49
## Gd 14 1.02 4.71 0.96 4.45
## Po 3 0.22 4.93 0.21 4.66
## TA 1311 95.07 100.00 89.79 94.45
## <NA> 81 5.55 100.00
## Total 1460 100.00 100.00 100.00 100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = GarageQual))+geom_bar()+
coord_polar("y")+labs(x="GarageQual",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = GarageQual))+geom_bar()+
labs(x="GarageQual",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
NA values in this variable mean that there isn’t a garage, so we don’t
have to impute those values. In order to perform a correct
implementation of the classification methods, we transform this variable
to a numerical ordering.
# Create a mapping for the transformation
garage_qual_mapping <- c(TA = 3, Po = 1, Gd = 4, Fa = 2, Ex = 5)
# Transform GarageQual variable
data <- data %>%
mutate(GarageQual_numeric = ifelse(is.na(GarageQual), 0, garage_qual_mapping[as.character(GarageQual)]))
# Drop the GarageQual variable
data <- select(data, -GarageQual)
# Check the transformed data
print(data$GarageQual_numeric)
## [1] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 0 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3
## [75] 0 3 3 3 3 3 2 0 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 0 3 3 3 3 2 3 0 2 3 3 3 2 0 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 2
## [149] 2 3 3 3 2 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 0 3 3 3 4 3 0 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 0 3 3 3 3 3 3 3
## [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 3 3 3 3
## [408] 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3
## [445] 3 3 3 3 3 3 3 3 3 3 3 2 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [482] 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3
## [519] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [556] 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 0 3 3 3 3
## [593] 3 3 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 2 3 3 3 3 3 3 3
## [630] 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3
## [667] 4 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [704] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3
## [741] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 0 3 3 3 3 0 0 3 3 0 3 3 3 3 3 3 3 3
## [778] 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 0 0 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2
## [815] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [852] 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
## [889] 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 2 3 3 3 3 3 3 3 3 2 0 3 3 3 3
## [926] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 0
## [963] 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3
## [1000] 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 0 3 0 0
## [1037] 2 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1074] 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3
## [1111] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 2 0 3 3 3 3 3 3
# Classical numeric descriptive analysis
describe(data$GarageQual_numeric)
# Histogram and density plots
p1<-ggplot(data,aes(x=GarageQual_numeric))+geom_density()+
labs(title = "Density of GarageQual_numeric",x="GarageQual_numeric",y="Values")
p2<-ggplot(data,aes(x=GarageQual_numeric))+geom_histogram()+
labs(title = "Histogram of GarageQual_numeric",x="GarageQual_numeric",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = GarageQual_numeric)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of GarageQual_numeric", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$GarageQual_numeric)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 92
# Identify rows with outliers
outlier_rows <- which(data$GarageQual_numeric %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 8 26 35 68 75 81 82 90 103 116 121 123 124 128 129 136 148 149 153 159 171 196 216 231 244 249 262 310 339 343 345 357 363 388 394 403 414 442 456 459 484 502 514 556 580 588 595 618 622 641 657 667 668 721 729 756 760 765 766 769 788 793 794 805 814 861 885 889 908 911 920 921 951 962 963 975 998 1000 1023 1033 1035 1036 1037 1041 1049 1074 1090 1098 1118 1127 1131 1132
There are 92 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1046
print(length(unique(data$GarageQual_numeric)))
## [1] 1
Since removing the outliers results in a degenerated variable, we will remove it from the data, as it will not provide any new information in order to classify the sale price.
# Drop the GarageQual_numeric variable
data <- select(data, -GarageQual_numeric)
Above grade (ground) living area square feet
# Classical numeric descriptive analysis
describe(GrLivArea)
# Histogram and density plots
p1<-ggplot(data,aes(x=GrLivArea))+geom_density()+
labs(title = "Density of GrLivArea",x="GrLivArea",y="Values")
p2<-ggplot(data,aes(x=GrLivArea))+geom_histogram()+
labs(title = "Histogram of GrLivArea",x="GrLivArea",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$GrLivArea))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
0 missing values
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = GrLivArea)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of GrLivArea", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$GrLivArea)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 17
# Identify rows with outliers
outlier_rows <- which(data$GrLivArea %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 47 89 230 231 355 377 439 464 582 587 704 743 757 830 848 971 994
There are 17 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1029
# Histogram representation of the column GrLivArea
par(mfcol = c(1, 1))
# Define the column name
j0 <- "GrLivArea"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)
# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)
# Representation of normal quantiles for the column GrLivArea
par(mfrow = c(1, 1))
# Define the column name
j0 <- "GrLivArea"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])
This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.
The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.
# Print the row corresponding to GrLivArea
print(subset(result, variable == "GrLivArea"))
## variable value
## 7 GrLivArea 1.034275e-16
data_tidy <- melt(data, value.name = "value")
# Use the aggregate function with the correct FUN argument
result <- aggregate(value ~ variable, data = data_tidy,
FUN = function(x) shapiro.test(x)$p.value)
# Print the row corresponding to GrLivArea
print(subset(result, variable == "GrLivArea"))
## variable value
## 7 GrLivArea 1.505932e-11
There is evidence of lack of univariate normality (p-value < 0.05).
First Floor square feet
# Classical numeric descriptive analysis
describe(X1stFlrSF)
# Histogram and density plots
p1<-ggplot(data,aes(x=X1stFlrSF))+geom_density()+
labs(title = "Density of X1stFlrSF",x="X1stFlrSF",y="Values")
p2<-ggplot(data,aes(x=X1stFlrSF))+geom_histogram()+
labs(title = "Histogram of X1stFlrSF",x="X1stFlrSF",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$X1stFlrSF))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
0 missing values
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = X1stFlrSF)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of X1stFlrSF", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$X1stFlrSF)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 9
# Identify rows with outliers
outlier_rows <- which(data$X1stFlrSF %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 57 127 161 310 642 650 743 853 970
There are 13 outliers, as there enough data records, we will eliminate them.
# Remove rows with outliers
data <- data[-outlier_rows, ]
# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1020
# Histogram representation of the column X1stFlrSF
par(mfcol = c(1, 1))
# Define the column name
j0 <- "X1stFlrSF"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)
# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)
# Representation of normal quantiles for the column X1stFlrSF
par(mfrow = c(1, 1))
# Define the column name
j0 <- "X1stFlrSF"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])
This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.
The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.
data_tidy <- melt(data, value.name = "value")
# Use the aggregate function with the correct FUN argument
result <- aggregate(value ~ variable, data = data_tidy,
FUN = function(x) shapiro.test(x)$p.value)
# Print the row corresponding to X1stFlrSF
print(subset(result, variable == "X1stFlrSF"))
## variable value
## 8 X1stFlrSF 2.088141e-15
There is evidence of lack of univariate normality (p-value < 0.05).
Second floor square feet
# Classical numeric descriptive analysis
describe(X2ndFlrSF)
# Histogram and density plots
p1<-ggplot(data,aes(x=X2ndFlrSF))+geom_density()+
labs(title = "Density of X2ndFlrSF",x="X2ndFlrSF",y="Values")
p2<-ggplot(data,aes(x=X2ndFlrSF))+geom_histogram()+
labs(title = "Histogram of X2ndFlrSF",x="X2ndFlrSF",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$X2ndFlrSF))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
0 missing values
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = X2ndFlrSF)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of X2ndFlrSF", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$X2ndFlrSF)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 0
# Identify rows with outliers
outlier_rows <- which(data$X2ndFlrSF %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers:
There are 0 outliers.
# Histogram representation of the column X1stFlrSF
par(mfcol = c(1, 1))
# Define the column name
j0 <- "X2ndFlrSF"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)
# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)
# Representation of normal quantiles for the column X2ndFlrSF
par(mfrow = c(1, 1))
# Define the column name
j0 <- "X2ndFlrSF"
# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)
# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])
This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.
The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.
# Melt the data
data_tidy <- melt(data, value.name = "value")
result <- aggregate(value ~ variable, data = data_tidy,
FUN = function(x) shapiro.test(x)$p.value)
# Print the row corresponding to X2ndFlrSF
print(subset(result, variable == "X2ndFlrSF"))
## variable value
## 9 X2ndFlrSF 5.323629e-38
print(result)
## variable value
## 1 MSSubClass 7.504974e-32
## 2 LotFrontage 9.076679e-12
## 3 LotArea 2.162218e-04
## 4 OverallQual 1.520600e-19
## 5 OverallCond 7.358633e-38
## 6 YearBuilt 1.316966e-24
## 7 GrLivArea 2.963722e-11
## 8 X1stFlrSF 2.088141e-15
## 9 X2ndFlrSF 5.323629e-38
## 10 YrSold 1.243031e-25
## 11 SalePrice 5.007286e-22
There is evidence of lack of univariate normality (p-value < 0.05).
Year Sold (YYYY)
# Classical numeric descriptive analysis
describe(YrSold)
# Histogram and density plots
p1<-ggplot(data,aes(x=YrSold))+geom_density()+
labs(title = "Density of YrSold",x="YrSold",y="Values")
p2<-ggplot(data,aes(x=YrSold))+geom_histogram()+
labs(title = "Histogram of YrSold",x="YrSold",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
# Missing Values
na_counts <- sum(is.na(data$YrSold))
print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0
0 missing values
# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = YrSold)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
coord_flip() +
labs(title = "Boxplot of YrSold", x = "Values", y = "")
ggarrange(p1, nrow=1, common.legend = FALSE)
# Count outliers
outliers <- boxplot.stats(data$YrSold)$out
# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 0
# Identify rows with outliers
outlier_rows <- which(data$YrSold %in% outliers)
# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers:
There are 0 outliers.
It isn’t a continuous variable so we skip this step.
Type of sale
# Basic descriptive statistics
freq(SaleType)
## Frequencies
## SaleType
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## COD 43 2.95 2.95 2.95 2.95
## Con 2 0.14 3.08 0.14 3.08
## ConLD 9 0.62 3.70 0.62 3.70
## ConLI 5 0.34 4.04 0.34 4.04
## ConLw 5 0.34 4.38 0.34 4.38
## CWD 4 0.27 4.66 0.27 4.66
## New 122 8.36 13.01 8.36 13.01
## Oth 3 0.21 13.22 0.21 13.22
## WD 1267 86.78 100.00 86.78 100.00
## <NA> 0 0.00 100.00
## Total 1460 100.00 100.00 100.00 100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = SaleType))+geom_bar()+
coord_polar("y")+labs(x="SaleType",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = SaleType))+geom_bar()+
labs(x="SaleType",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
There aren’t any missing values, so we will continue with the next
variable.
Condition of sale
# Basic descriptive statistics
freq(SaleCondition)
## Frequencies
## SaleCondition
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
## Abnorml 101 6.92 6.92 6.92 6.92
## AdjLand 4 0.27 7.19 0.27 7.19
## Alloca 12 0.82 8.01 0.82 8.01
## Family 20 1.37 9.38 1.37 9.38
## Normal 1198 82.05 91.44 82.05 91.44
## Partial 125 8.56 100.00 8.56 100.00
## <NA> 0 0.00 100.00
## Total 1460 100.00 100.00 100.00 100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = SaleCondition))+geom_bar()+
coord_polar("y")+labs(x="SaleCondition",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = SaleCondition))+geom_bar()+
labs(x="SaleCondition",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
There aren’t any missing values, so we will continue with the next
variable.
Sale price of the property
# Classical numeric descriptive analysis
describe(SalePrice)
# Histogram and density plots
p1<-ggplot(data,aes(x=SalePrice))+geom_density()+
labs(title = "Density of SalePrice",x="SalePrice",y="Values")
p2<-ggplot(data,aes(x=SalePrice))+geom_histogram()+
labs(title = "Histogram of SalePrice",x="SalePrice",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
We can perform a log transformation to correct the skewness of the variable
# Classical numeric descriptive analysis
describe(log(SalePrice))
# Histogram and density plots
p1<-ggplot(data,aes(x=log(SalePrice)))+geom_density()+
labs(title = "Density of SalePrice",x="SalePrice",y="Values")
p2<-ggplot(data,aes(x=log(SalePrice)))+geom_histogram()+
labs(title = "Histogram of SalePrice",x="SalePrice",y="Values")
# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)
We will split this target variable into two categories (intervals) of equal frequency
# Split the target variable into two categories of equal frequency
data$PriceCategory <- ntile(data$SalePrice, 2)
# Name each category
data$PriceCategory <- ifelse(data$PriceCategory == 1, "Cheap", "Expensive")
# Numeric descriptive analysis and graphics
describe(data$SalePrice)
p1 <- ggplot(data, aes(x = SalePrice)) +
geom_density() +
labs(title = "Density of SalePrice", x = "SalePrice", y = "Density")
p2 <- ggplot(data, aes(x = SalePrice)) +
geom_histogram(bins = 30) +
labs(title = "Histogram of SalePrice", x = "SalePrice", y = "Count")
ggarrange(p1, p2, nrow = 1, common.legend = FALSE)
# Verify the new variable
table(data$PriceCategory)
##
## Cheap Expensive
## 510 510
# Drop the SalePrice variable
data <- select(data, -SalePrice)
# To verify that the column has been removed, you can view the structure of the dataframe
str(data)
## 'data.frame': 1020 obs. of 16 variables:
## $ MSSubClass : int 60 60 70 60 50 20 60 20 60 20 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : num 65 68 60 84 85 ...
## $ LotArea : int 8450 11250 9550 14260 14115 10084 10382 11200 11924 12968 ...
## $ OverallQual : int 7 7 7 8 5 8 7 5 9 5 ...
## $ OverallCond : int 5 5 5 5 5 5 6 5 5 6 ...
## $ YearBuilt : int 2003 2001 1915 2000 1993 2004 1973 1965 2005 1962 ...
## $ ExterQual : chr "Gd" "Gd" "TA" "Gd" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ GrLivArea : int 1710 1786 1717 2198 1362 1694 2090 1040 2324 912 ...
## $ X1stFlrSF : int 856 920 961 1145 796 1694 1107 1040 1182 912 ...
## $ X2ndFlrSF : int 854 866 756 1053 566 0 983 0 1142 0 ...
## $ YrSold : int 2008 2008 2006 2008 2009 2007 2009 2008 2006 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Abnorml" "Normal" ...
## $ PriceCategory: chr "Expensive" "Expensive" "Cheap" "Expensive" ...
# Basic descriptive statistics
freq(data$PriceCategory)
## Frequencies
## data$PriceCategory
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## --------------- ------ --------- -------------- --------- --------------
## Cheap 510 50.00 50.00 50.00 50.00
## Expensive 510 50.00 100.00 50.00 100.00
## <NA> 0 0.00 100.00
## Total 1020 100.00 100.00 100.00 100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = PriceCategory))+geom_bar()+
coord_polar("y")+labs(x="PriceCategory",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = PriceCategory))+geom_bar()+
labs(x="PriceCategory",y="%")
# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)
There aren’t any missing values.
We have finished our univariate exploratory analysis. We continue with the Multivariate exploratory analysis.
First, we check the assumptions of correlation between variables with the Bartlett test.
Done in previous sections.
# Parameters 'scale' and 'center' are set to TRUE to consider standardized data
PCA<-prcomp(numeric_data, scale=T, center = T)
# The field 'rotation' of the 'PCA' object is a matrix
# Its columns are the coefficients of the principal components
# Indicates the weight of each variable in the corresponding principal component
PCA$rotation
## PC1 PC2 PC3 PC4 PC5
## MSSubClass -0.04939892 -0.612183913 0.08492296 -0.001724381 -0.23614639
## LotFrontage -0.29832446 0.377582778 -0.23284795 0.033120106 0.22738320
## LotArea -0.28555488 0.410844194 -0.31175988 -0.063712254 0.08520617
## OverallQual -0.46510644 -0.113538358 0.17329696 0.055240588 -0.23347625
## OverallCond 0.26395972 0.172250807 -0.32896903 0.109053081 -0.69413470
## YearBuilt -0.39782368 -0.114372813 0.34625579 0.022633337 0.22673253
## GrLivArea -0.47599950 -0.131942421 -0.27965660 0.019033090 -0.26227163
## X1stFlrSF -0.31117519 0.315849190 0.40163925 0.012687484 -0.45923559
## X2ndFlrSF -0.24327394 -0.375543201 -0.58613095 0.009858948 0.09705571
## YrSold 0.01305882 0.004949914 0.01260068 0.989325042 0.09179512
## PC6 PC7 PC8 PC9 PC10
## MSSubClass -0.109772774 -0.50699006 0.53704011 0.048050668 8.229278e-04
## LotFrontage 0.268123202 -0.75559624 -0.12374617 0.057269306 -2.630894e-03
## LotArea -0.188615340 0.25960338 0.72529456 0.110136422 6.402832e-05
## OverallQual 0.364478031 0.20452667 -0.09771754 0.706374719 1.082659e-03
## OverallCond 0.495443825 0.02190150 0.10787432 -0.202821254 -1.074622e-03
## YearBuilt 0.508924782 0.16788156 0.19217872 -0.576707138 -1.139130e-02
## GrLivArea -0.305290803 0.04782402 -0.25957016 -0.245471309 -6.209161e-01
## X1stFlrSF -0.374829381 -0.11352645 -0.10116179 -0.193892783 4.793991e-01
## X2ndFlrSF -0.005033558 0.13586244 -0.17913630 -0.107910847 6.200807e-01
## YrSold -0.097190601 0.02263710 0.04903243 0.009652686 3.040544e-05
# Standard deviations of each principal component
PCA$sdev
## [1] 1.74325300 1.33490571 1.24615986 1.00332182 0.90133109 0.77844570
## [7] 0.72139346 0.66261791 0.48872537 0.05317622
Each principal component is obtained in a simple way as a linear combination of all the variables with the coefficients indicated by the columns of the rotation matrix.
# The function 'summary' applied to the 'PCA' object provides relevant information
# - Standard deviations of each principal component
# - Proportion of variance explained and cummulative variance
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7433 1.3349 1.2462 1.0033 0.90133 0.7784 0.72139
## Proportion of Variance 0.3039 0.1782 0.1553 0.1007 0.08124 0.0606 0.05204
## Cumulative Proportion 0.3039 0.4821 0.6374 0.7380 0.81929 0.8799 0.93193
## PC8 PC9 PC10
## Standard deviation 0.66262 0.48873 0.05318
## Proportion of Variance 0.04391 0.02389 0.00028
## Cumulative Proportion 0.97583 0.99972 1.00000
# The following graph shows the proportion of explained variance
Explained_variance <- PCA$sdev^2 / sum(PCA$sdev^2)
p1<-ggplot(data = data.frame(Explained_variance, pc = 1:10),
aes(x = pc, y = Explained_variance, fill=Explained_variance )) +
geom_col(width = 0.3) + scale_y_continuous(limits = c(0,0.6)) + theme_bw() +
labs(x = "Principal component", y= "Proportion of variance")
# The following graph shows the proportion of cumulative explained variance
Cummulative_variance<-cumsum(Explained_variance)
p2<-ggplot( data = data.frame(Cummulative_variance, pc = 1:10),
aes(x = pc, y = Cummulative_variance ,fill=Cummulative_variance )) +
geom_col(width = 0.5) + scale_y_continuous(limits = c(0,1)) + theme_bw() +
labs(x = "Principal component",
y = "Proportion of cumulative variance")
p1
p2
fviz_eig(PCA)
We apply the rule of Abdi et al., only four principal components are considered, as can be deduced from the following code chunk.
#######################
# Rule of Abdi et al. #
#######################
# Variances
PCA$sdev^2
## [1] 3.03893101 1.78197326 1.55291439 1.00665468 0.81239774 0.60597771
## [7] 0.52040853 0.43906249 0.23885248 0.00282771
# Average of variances
mean(PCA$sdev^2)
## [1] 1
# These graphical outputs show the projection of the variables in two dimensions
# Display the weight of the variable in the direction of the principal component
# Projection of variables on the first and second principal components
p1 <- fviz_pca_var(PCA, axes = c(1, 2), repel = TRUE, col.var = "cos2",
legend.title = "Distance", title = "Variables (PC1 vs PC2)") + theme_bw()
# Projection on the first and third principal components
p2 <- fviz_pca_var(PCA, axes = c(1, 3), repel = TRUE, col.var = "cos2",
legend.title = "Distance", title = "Variables (PC1 vs PC3)") + theme_bw()
# Projection on the first and fourth principal components
p3 <- fviz_pca_var(PCA, axes = c(1, 4), repel = TRUE, col.var = "cos2",
legend.title = "Distance", title = "Variables (PC1 vs PC4)") + theme_bw()
# Projection on the second and third principal components
p4 <- fviz_pca_var(PCA, axes = c(2, 3), repel = TRUE, col.var = "cos2",
legend.title = "Distance", title = "Variables (PC2 vs PC3)") + theme_bw()
# Projection on the second and fourth principal components
p5 <- fviz_pca_var(PCA, axes = c(2, 4), repel = TRUE, col.var = "cos2",
legend.title = "Distance", title = "Variables (PC2 vs PC4)") + theme_bw()
# Projection on the third and fourth principal components
p6 <- fviz_pca_var(PCA, axes = c(3, 4), repel = TRUE, col.var = "cos2",
legend.title = "Distance", title = "Variables (PC3 vs PC4)") + theme_bw()
# Displaying graphics
p1
p2
p3
p4
p5
p6
# It is also possible to represent the observations
# As well as identify with colors those observations that explain the greatest
# variance of the principal components
p1<-fviz_pca_ind(PCA,axes = c(1,2),col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()
p2<-fviz_pca_ind(PCA,axes=c(1,3),col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()
p3<-fviz_pca_ind(PCA,axes=c(1,4),col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()
p4<-fviz_pca_ind(PCA,axes=c(2,3),col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()
p5<-fviz_pca_ind(PCA,axes=c(2,4),col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()
p6<-fviz_pca_ind(PCA,axes=c(3,4),col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()
# Displaying graphics
p1
p2
p3
p4
p5
p6
Since there are so many rows, the previous outputs are not too clear.
# Joint representation of variables and observations
# Relates the possible relationships between the contributions of the records
# to the variances of the components and the weight of the variables in each
# principal component
p1<-fviz_pca(PCA,axes=c(1,2),alpha.ind ="contrib", col.var = "cos2",
col.ind="seagreen",
gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
repel=TRUE, legend.title="Distancia")+theme_bw()
p2<-fviz_pca(PCA,axes=c(1,3),alpha.ind ="contrib",
col.var = "cos2",col.ind="seagreen",
gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
repel=TRUE, legend.title="Distancia")+theme_bw()
p3<-fviz_pca(PCA,axes=c(1,4),alpha.ind ="contrib",
col.var = "cos2",col.ind="seagreen",
gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
repel=TRUE, legend.title="Distancia")+theme_bw()
p4<-fviz_pca(PCA,axes=c(2,3),alpha.ind ="contrib",
col.var = "cos2",col.ind="seagreen",
gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
repel=TRUE, legend.title="Distancia")+theme_bw()
p5<-fviz_pca(PCA,axes=c(2,4),alpha.ind ="contrib",
col.var = "cos2",col.ind="seagreen",
gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
repel=TRUE, legend.title="Distancia")+theme_bw()
p6<-fviz_pca(PCA,axes=c(3,4),alpha.ind ="contrib",
col.var = "cos2",col.ind="seagreen",
gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
repel=TRUE, legend.title="Distancia")+theme_bw()
# Displaying graphics
p1
p2
p3
p4
p5
p6
Finally, since the object of this part was to reduce the dimension of the data set, it is possible to obtain the coordinates of the original data in the new reference system.
head(PCA$x)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 1 -0.8452027 -1.6738113 -0.4179040 0.1597390 0.7328336 0.53822810 0.6112833
## 3 -1.3295188 -1.1303234 -0.8048391 0.1047746 0.7172377 0.23081222 0.7016935
## 4 0.3536070 -1.2804509 -1.3250047 -1.4367515 -0.4417883 -1.30933047 0.1670796
## 5 -3.1108652 -0.4084764 -1.5479228 0.1434477 0.3476182 0.01827479 0.3296982
## 6 -0.3612626 0.4427567 -1.2539408 0.6975673 1.8817162 0.05321400 -0.1994333
## 7 -1.8359116 1.1176454 1.4965455 -0.5418138 -0.3665633 0.17510923 0.4713932
## PC8 PC9 PC10
## 1 -0.2628569 -0.13551171 -3.170197e-03
## 3 0.3812697 -0.05142864 -2.022434e-03
## 4 -0.4464533 1.73267727 3.571344e-02
## 5 0.5714314 0.27611780 8.193545e-05
## 6 1.3914121 -0.38731959 -6.923908e-03
## 7 -0.5925704 0.14855231 -2.013263e-03
There are different criteria, among which the Scree plot (Cattel 1966) and parallel analysis (Horn 1965) stand out. According to the following graphical outputs, 3 is considered to be the optimal number of factors (parallel analysis).
# Scree plot
scree(poly_cor)
#Parallel analysis
fa.parallel(poly_cor,n.obs=100,fa="fa",fm="ml")
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
The factorial model with 3 factors implementing a varimax rotation to seek a simpler interpretation is performed.
modelo_varimax<-fa(poly_cor,nfactors = 3,rotate = "varimax",
fa="mle")
# The rotated factorial matrix is shown
print(modelo_varimax$loadings,cut=0)
##
## Loadings:
## MR1 MR2 MR3
## MSSubClass 0.221 0.245 -0.581
## LotFrontage 0.159 0.196 0.589
## LotArea 0.078 0.243 0.713
## OverallQual 0.773 0.284 0.104
## OverallCond -0.458 -0.072 0.066
## YearBuilt 0.739 0.108 0.016
## GrLivArea 0.470 0.699 0.266
## X1stFlrSF 0.599 -0.214 0.401
## X2ndFlrSF -0.059 1.019 -0.086
## YrSold -0.006 -0.014 -0.009
##
## MR1 MR2 MR3
## SS loadings 2.016 1.828 1.447
## Proportion Var 0.202 0.183 0.145
## Cumulative Var 0.202 0.384 0.529
Visually we could make the effort to see what variables each correlates with one of the factors, but it is very tedious. So we use the following representation in diagram mode.
fa.diagram(modelo_varimax)
Another way to do the previous analysis.
# This function only performs the mle method
FA<-factanal(numeric_data,factors=3, rotation="varimax")
FA$loadings
##
## Loadings:
## Factor1 Factor2 Factor3
## MSSubClass 0.125 0.282
## LotFrontage 0.202 0.310
## LotArea 0.122 0.389
## OverallQual 0.664 0.400
## OverallCond -0.465
## YearBuilt 0.994
## GrLivArea 0.309 0.888 0.333
## X1stFlrSF 0.404 0.565 -0.717
## X2ndFlrSF 0.453 0.889
## YrSold
##
## Factor1 Factor2 Factor3
## SS loadings 1.976 1.729 1.507
## Proportion Var 0.198 0.173 0.151
## Cumulative Var 0.198 0.370 0.521
# Royston multivariate normality test
royston_test <- MVN::mvn(data = numeric_data, mvnTest = "royston", multivariatePlot = "qq")
royston_test$multivariateNormality
# Henze-Zirkler multivariate normality test
hz_test <- MVN::mvn(data = numeric_data, mvnTest = "hz")
hz_test$multivariateNormality
We find evidence of a lack of multivariate normality at the 5% level.
We will work with a partition of the dataset, as a training set (80% of the data) on which we will perform the linear discriminant analysis, and another partition, as a test set (20%), with which we perform the validation of the model.
# The response variable has to be an object of the class 'factor' of R language
data$PriceCategory<-as.factor(data$PriceCategory)
set.seed(3)
# Partitioning the dataset: training (80%) + test (20%)
trainIndex<-createDataPartition(data$PriceCategory,p=0.80)$Resample
datos_train<-data[trainIndex,]
datos_test<-data[-trainIndex,]
datos_train
datos_test
The lda function of the MASS package adjusts this model.
modelo_lda <- lda(formula = PriceCategory ~ LotFrontage + MSSubClass + MSZoning + ExterQual + ExterCond + SaleType + SaleCondition + LotArea + OverallQual + OverallCond + YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + YrSold ,data = datos_train)
modelo_lda
## Call:
## lda(PriceCategory ~ LotFrontage + MSSubClass + MSZoning + ExterQual +
## ExterCond + SaleType + SaleCondition + LotArea + OverallQual +
## OverallCond + YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF +
## YrSold, data = datos_train)
##
## Prior probabilities of groups:
## Cheap Expensive
## 0.5 0.5
##
## Group means:
## LotFrontage MSSubClass MSZoningFV MSZoningRH MSZoningRL MSZoningRM
## Cheap 67.79471 46.06618 0.00000000 0.01715686 0.7696078 0.20098039
## Expensive 73.35384 53.99510 0.08823529 0.00245098 0.8946078 0.01470588
## ExterQualFa ExterQualGd ExterQualTA ExterCondGd ExterCondTA
## Cheap 0.009803922 0.07843137 0.9117647 0.10784314 0.8750000
## Expensive 0.000000000 0.68137255 0.2622549 0.04656863 0.9534314
## SaleTypeCon SaleTypeConLD SaleTypeConLI SaleTypeConLw SaleTypeCWD
## Cheap 0.000000000 0.002450980 0.002450980 0.00245098 0.004901961
## Expensive 0.004901961 0.004901961 0.004901961 0.00245098 0.002450980
## SaleTypeNew SaleTypeOth SaleTypeWD SaleConditionAdjLand
## Cheap 0.01960784 0.00245098 0.9142157 0.00245098
## Expensive 0.16421569 0.00000000 0.8039216 0.00000000
## SaleConditionAlloca SaleConditionFamily SaleConditionNormal
## Cheap 0.009803922 0.022058824 0.8553922
## Expensive 0.002450980 0.004901961 0.7916667
## SaleConditionPartial LotArea OverallQual OverallCond YearBuilt
## Cheap 0.01960784 8880.846 5.276961 5.664216 1959.056
## Expensive 0.16666667 10205.684 7.132353 5.235294 1995.265
## GrLivArea X1stFlrSF X2ndFlrSF YrSold
## Cheap 1234.218 1054.777 172.2819 2007.782
## Expensive 1758.924 1298.103 459.8799 2007.811
##
## Coefficients of linear discriminants:
## LD1
## LotFrontage -1.342848e-02
## MSSubClass -2.456770e-03
## MSZoningFV -3.106813e-01
## MSZoningRH -1.301194e+00
## MSZoningRL -8.116036e-01
## MSZoningRM -1.177599e+00
## ExterQualFa 6.290882e-01
## ExterQualGd 6.985497e-01
## ExterQualTA -6.506843e-02
## ExterCondGd 1.856076e-01
## ExterCondTA 2.793930e-01
## SaleTypeCon 8.240608e-01
## SaleTypeConLD -7.382923e-02
## SaleTypeConLI -2.245549e-01
## SaleTypeConLw 4.856318e-01
## SaleTypeCWD -7.588695e-01
## SaleTypeNew -4.715859e-01
## SaleTypeOth 6.837812e-01
## SaleTypeWD 4.329793e-01
## SaleConditionAdjLand -3.115268e-02
## SaleConditionAlloca -5.066976e-01
## SaleConditionFamily -3.962297e-01
## SaleConditionNormal 7.713446e-02
## SaleConditionPartial 8.615625e-01
## LotArea 3.085357e-05
## OverallQual 3.331124e-01
## OverallCond 1.781541e-01
## YearBuilt 2.393942e-02
## GrLivArea 1.358962e-03
## X1stFlrSF 5.693947e-04
## X2ndFlrSF 3.646943e-04
## YrSold 8.389040e-03
nuevas_observaciones <- datos_test
predict(object = modelo_lda, newdata = nuevas_observaciones)
## $class
## [1] Expensive Cheap Cheap Cheap Cheap Cheap Expensive
## [8] Cheap Cheap Cheap Cheap Cheap Expensive Cheap
## [15] Expensive Cheap Expensive Cheap Cheap Expensive Expensive
## [22] Expensive Cheap Cheap Expensive Expensive Cheap Cheap
## [29] Expensive Expensive Cheap Expensive Cheap Cheap Expensive
## [36] Expensive Expensive Expensive Cheap Expensive Cheap Cheap
## [43] Cheap Cheap Cheap Cheap Expensive Cheap Expensive
## [50] Expensive Expensive Cheap Cheap Cheap Cheap Expensive
## [57] Expensive Cheap Expensive Expensive Expensive Cheap Expensive
## [64] Expensive Cheap Cheap Cheap Expensive Expensive Expensive
## [71] Cheap Expensive Cheap Expensive Cheap Expensive Cheap
## [78] Expensive Cheap Cheap Cheap Cheap Cheap Expensive
## [85] Cheap Cheap Cheap Cheap Expensive Cheap Expensive
## [92] Cheap Cheap Expensive Expensive Cheap Cheap Cheap
## [99] Expensive Cheap Expensive Expensive Cheap Cheap Expensive
## [106] Cheap Cheap Cheap Expensive Expensive Cheap Expensive
## [113] Cheap Expensive Expensive Cheap Expensive Expensive Expensive
## [120] Cheap Cheap Expensive Expensive Cheap Expensive Cheap
## [127] Expensive Cheap Cheap Cheap Expensive Expensive Expensive
## [134] Cheap Expensive Cheap Cheap Cheap Cheap Expensive
## [141] Cheap Expensive Expensive Cheap Expensive Cheap Expensive
## [148] Expensive Cheap Expensive Cheap Cheap Expensive Expensive
## [155] Expensive Expensive Expensive Expensive Cheap Cheap Expensive
## [162] Cheap Cheap Cheap Expensive Cheap Expensive Cheap
## [169] Expensive Expensive Expensive Cheap Expensive Cheap Cheap
## [176] Cheap Expensive Cheap Cheap Cheap Cheap Cheap
## [183] Cheap Cheap Expensive Cheap Expensive Expensive Expensive
## [190] Expensive Expensive Expensive Expensive Expensive Expensive Cheap
## [197] Cheap Expensive Cheap Cheap Cheap Cheap Expensive
## [204] Cheap
## Levels: Cheap Expensive
##
## $posterior
## Cheap Expensive
## 14 3.081527e-02 9.691847e-01
## 29 6.125823e-01 3.874177e-01
## 41 9.663121e-01 3.368793e-02
## 45 9.889462e-01 1.105377e-02
## 53 9.994850e-01 5.150359e-04
## 55 9.852142e-01 1.478584e-02
## 58 1.377039e-02 9.862296e-01
## 72 9.960880e-01 3.911951e-03
## 74 9.930967e-01 6.903331e-03
## 77 9.981125e-01 1.887512e-03
## 78 9.960688e-01 3.931155e-03
## 84 9.990521e-01 9.478909e-04
## 85 3.902191e-01 6.097809e-01
## 98 9.984298e-01 1.570150e-03
## 102 7.897845e-02 9.210215e-01
## 103 9.858893e-01 1.411074e-02
## 104 4.463689e-01 5.536311e-01
## 108 9.973503e-01 2.649738e-03
## 111 8.324230e-01 1.675770e-01
## 131 6.957034e-02 9.304297e-01
## 134 2.334591e-02 9.766541e-01
## 142 3.514416e-03 9.964856e-01
## 143 9.972669e-01 2.733086e-03
## 157 9.925769e-01 7.423125e-03
## 163 2.420339e-02 9.757966e-01
## 168 4.085761e-04 9.995914e-01
## 187 6.936963e-01 3.063037e-01
## 189 9.943953e-01 5.604674e-03
## 193 2.280750e-02 9.771925e-01
## 200 1.665426e-03 9.983346e-01
## 205 9.946169e-01 5.383127e-03
## 213 9.209495e-03 9.907905e-01
## 214 8.632973e-01 1.367027e-01
## 216 9.827641e-01 1.723593e-02
## 220 7.952515e-02 9.204748e-01
## 221 1.797721e-02 9.820228e-01
## 239 2.602022e-02 9.739798e-01
## 241 6.632742e-04 9.993367e-01
## 260 9.993340e-01 6.659559e-04
## 271 3.939904e-04 9.996060e-01
## 274 7.425348e-01 2.574652e-01
## 275 9.881600e-01 1.184004e-02
## 276 5.485902e-01 4.514098e-01
## 289 9.974449e-01 2.555118e-03
## 290 9.931615e-01 6.838546e-03
## 295 8.651106e-01 1.348894e-01
## 299 1.784830e-01 8.215170e-01
## 304 9.896898e-01 1.031022e-02
## 317 9.246537e-04 9.990753e-01
## 337 1.882270e-04 9.998118e-01
## 339 1.333471e-02 9.866653e-01
## 353 9.969227e-01 3.077311e-03
## 356 5.873161e-01 4.126839e-01
## 362 9.919363e-01 8.063747e-03
## 373 9.883371e-01 1.166288e-02
## 375 2.759836e-03 9.972402e-01
## 386 1.929004e-02 9.807100e-01
## 397 9.942107e-01 5.789256e-03
## 410 1.850165e-04 9.998150e-01
## 415 1.216942e-03 9.987831e-01
## 424 4.868152e-05 9.999513e-01
## 425 9.807072e-01 1.929276e-02
## 436 2.863908e-03 9.971361e-01
## 445 8.006139e-03 9.919939e-01
## 446 9.137502e-01 8.624979e-02
## 450 9.995545e-01 4.455108e-04
## 467 8.992516e-01 1.007484e-01
## 470 5.653525e-02 9.434647e-01
## 471 4.386637e-02 9.561336e-01
## 472 1.140252e-01 8.859748e-01
## 476 9.961019e-01 3.898109e-03
## 479 1.345439e-03 9.986546e-01
## 486 9.403224e-01 5.967758e-02
## 516 5.228055e-04 9.994772e-01
## 523 9.158069e-01 8.419308e-02
## 526 8.730629e-03 9.912694e-01
## 533 9.964766e-01 3.523356e-03
## 542 2.069119e-04 9.997931e-01
## 544 9.724435e-01 2.755654e-02
## 549 9.985283e-01 1.471736e-03
## 551 9.896630e-01 1.033697e-02
## 561 9.514591e-01 4.854092e-02
## 562 9.378913e-01 6.210867e-02
## 573 8.760233e-02 9.123977e-01
## 577 9.093343e-01 9.066575e-02
## 587 9.991866e-01 8.134493e-04
## 588 9.932753e-01 6.724738e-03
## 594 9.292477e-01 7.075226e-02
## 596 8.211967e-04 9.991788e-01
## 599 5.963614e-01 4.036386e-01
## 605 2.075367e-02 9.792463e-01
## 610 9.990498e-01 9.502455e-04
## 616 9.883942e-01 1.160581e-02
## 620 1.997877e-04 9.998002e-01
## 622 7.732523e-03 9.922675e-01
## 625 6.145925e-01 3.854075e-01
## 626 9.799370e-01 2.006303e-02
## 629 7.364559e-01 2.635441e-01
## 632 2.302037e-03 9.976980e-01
## 634 9.939419e-01 6.058087e-03
## 641 3.609356e-02 9.639064e-01
## 642 9.704052e-04 9.990296e-01
## 644 9.292809e-01 7.071914e-02
## 647 9.970956e-01 2.904402e-03
## 653 1.366774e-02 9.863323e-01
## 657 8.949552e-01 1.050448e-01
## 664 9.973047e-01 2.695273e-03
## 682 9.988072e-01 1.192777e-03
## 684 1.462187e-03 9.985378e-01
## 690 9.975925e-02 9.002407e-01
## 696 6.215029e-01 3.784971e-01
## 713 5.806549e-02 9.419345e-01
## 718 9.712271e-01 2.877291e-02
## 725 4.502434e-03 9.954976e-01
## 732 1.399487e-01 8.600513e-01
## 735 9.989741e-01 1.025876e-03
## 743 2.381036e-01 7.618964e-01
## 754 5.326188e-04 9.994674e-01
## 767 2.114580e-01 7.885420e-01
## 768 7.976654e-01 2.023346e-01
## 773 9.895303e-01 1.046968e-02
## 775 1.249986e-03 9.987500e-01
## 781 4.248697e-01 5.751303e-01
## 805 9.990533e-01 9.466539e-04
## 806 3.268942e-02 9.673106e-01
## 809 9.955434e-01 4.456597e-03
## 823 2.625268e-02 9.737473e-01
## 835 9.894502e-01 1.054979e-02
## 863 9.413821e-01 5.861788e-02
## 864 9.929764e-01 7.023600e-03
## 865 7.561672e-03 9.924383e-01
## 867 2.008621e-03 9.979914e-01
## 870 6.243752e-03 9.937562e-01
## 874 8.138007e-01 1.861993e-01
## 883 4.726083e-01 5.273917e-01
## 891 9.978743e-01 2.125712e-03
## 900 9.624537e-01 3.754631e-02
## 906 9.974376e-01 2.562356e-03
## 912 8.774902e-01 1.225098e-01
## 919 6.379959e-03 9.936200e-01
## 926 9.509625e-01 4.903745e-02
## 930 6.753586e-03 9.932464e-01
## 933 1.701486e-03 9.982985e-01
## 936 9.999687e-01 3.129702e-05
## 939 2.580893e-03 9.974191e-01
## 954 6.856779e-01 3.143221e-01
## 966 6.731369e-02 9.326863e-01
## 995 2.611728e-03 9.973883e-01
## 997 9.976103e-01 2.389705e-03
## 1003 2.516099e-03 9.974839e-01
## 1013 9.753896e-01 2.461043e-02
## 1015 9.883617e-01 1.163828e-02
## 1017 1.763431e-02 9.823657e-01
## 1018 3.505381e-01 6.494619e-01
## 1022 6.597022e-02 9.340298e-01
## 1024 1.340293e-02 9.865971e-01
## 1028 2.366180e-03 9.976338e-01
## 1044 8.904463e-03 9.910955e-01
## 1048 9.426095e-01 5.739050e-02
## 1053 5.834704e-01 4.165296e-01
## 1059 1.213542e-03 9.987865e-01
## 1076 9.896893e-01 1.031067e-02
## 1078 9.857162e-01 1.428383e-02
## 1080 9.599944e-01 4.000559e-02
## 1083 1.847632e-02 9.815237e-01
## 1086 9.222724e-01 7.772762e-02
## 1096 9.911931e-02 9.008807e-01
## 1099 9.991120e-01 8.880372e-04
## 1134 1.431311e-03 9.985687e-01
## 1153 4.897458e-01 5.102542e-01
## 1162 1.019704e-01 8.980296e-01
## 1163 9.992909e-01 7.090629e-04
## 1182 9.483142e-02 9.051686e-01
## 1186 9.977900e-01 2.210048e-03
## 1217 5.373391e-01 4.626609e-01
## 1223 7.298534e-01 2.701466e-01
## 1225 1.002582e-02 9.899742e-01
## 1232 9.898524e-01 1.014756e-02
## 1243 9.698644e-01 3.013557e-02
## 1250 9.943082e-01 5.691763e-03
## 1256 9.945413e-01 5.458703e-03
## 1272 5.983014e-01 4.016986e-01
## 1275 9.994212e-01 5.787550e-04
## 1277 8.183364e-01 1.816636e-01
## 1279 7.304271e-04 9.992696e-01
## 1287 9.311929e-01 6.880711e-02
## 1289 2.121189e-03 9.978788e-01
## 1301 2.568067e-03 9.974319e-01
## 1330 1.157762e-01 8.842238e-01
## 1339 8.142460e-03 9.918575e-01
## 1349 1.147212e-02 9.885279e-01
## 1361 3.442294e-01 6.557706e-01
## 1367 8.917968e-03 9.910820e-01
## 1370 6.846544e-04 9.993153e-01
## 1376 6.510773e-03 9.934892e-01
## 1378 9.724688e-01 2.753122e-02
## 1383 9.465195e-01 5.348054e-02
## 1391 1.461918e-02 9.853808e-01
## 1399 9.004292e-01 9.957081e-02
## 1413 9.993880e-01 6.119850e-04
## 1419 9.972793e-01 2.720656e-03
## 1422 9.719974e-01 2.800264e-02
## 1427 2.170844e-03 9.978292e-01
## 1428 9.521519e-01 4.784806e-02
##
## $x
## LD1
## 14 1.18531630
## 29 -0.15748787
## 41 -1.15366004
## 45 -1.54465436
## 53 -2.60225814
## 55 -1.44336574
## 58 1.46817570
## 72 -1.90416691
## 74 -1.70790844
## 77 -2.15536337
## 78 -1.90247700
## 84 -2.39243613
## 85 0.15343576
## 98 -2.21874838
## 102 0.84429429
## 103 -1.45966485
## 104 0.07402211
## 108 -2.03850899
## 111 -0.55095592
## 131 0.89138452
## 134 1.28336926
## 142 1.94113855
## 143 -2.02783485
## 157 -1.68277537
## 163 1.27066883
## 168 2.68188717
## 187 -0.28098032
## 189 -1.77999093
## 193 1.29157859
## 200 2.19846687
## 205 -1.79393045
## 213 1.60803714
## 214 -0.63346796
## 216 -1.38980796
## 220 0.84171909
## 221 1.37507472
## 239 1.24514907
## 241 2.51526132
## 260 -2.51387344
## 271 2.69438710
## 274 -0.36406828
## 275 -1.52076182
## 276 -0.06701815
## 289 -2.05104025
## 290 -1.71117179
## 295 -0.63877919
## 299 0.52475080
## 304 -1.56884833
## 317 2.40097541
## 337 2.94835984
## 339 1.47937835
## 353 -1.98694195
## 356 -0.12129403
## 362 -1.65410054
## 373 -1.52600526
## 375 2.02447781
## 386 1.35038768
## 397 -1.76878945
## 410 2.95427434
## 415 2.30646138
## 424 3.41324387
## 425 -1.35033826
## 436 2.01171861
## 445 1.65658491
## 446 -0.81129712
## 450 -2.65212735
## 467 -0.75239213
## 470 0.96748057
## 471 1.05927403
## 472 0.70472839
## 476 -1.90539009
## 479 2.27191430
## 486 -0.94774108
## 516 2.59710888
## 523 -0.82036569
## 526 1.62655729
## 533 -1.94026219
## 542 2.91582181
## 544 -1.22488839
## 549 -2.24103113
## 551 -1.56794852
## 561 -1.02278412
## 562 -0.93312661
## 573 0.80543961
## 577 -0.79246909
## 587 -2.44505708
## 588 -1.71697961
## 594 -0.88515719
## 596 2.44179622
## 599 -0.13416503
## 605 1.32473630
## 610 -2.39158255
## 616 -1.52771117
## 620 2.92786756
## 622 1.66863218
## 625 -0.16040211
## 626 -1.33661179
## 629 -0.35322154
## 632 2.08697934
## 634 -1.75309476
## 641 1.12909492
## 642 2.38435967
## 644 -0.88533039
## 647 -2.00687865
## 653 1.47078323
## 657 -0.73639154
## 664 -2.03263665
## 682 -2.31336373
## 684 2.24327191
## 690 0.75616119
## 696 -0.17046417
## 713 0.95774272
## 718 -1.20961117
## 725 1.85564158
## 732 0.62410689
## 735 -2.36523343
## 743 0.39978796
## 754 2.59071348
## 767 0.45239671
## 768 -0.47151027
## 773 -1.56351739
## 775 2.29724127
## 781 0.10408476
## 805 -2.39288540
## 806 1.16435710
## 809 -1.85917459
## 823 1.24200987
## 835 -1.56086975
## 863 -0.95428661
## 864 -1.70193001
## 865 1.67637117
## 867 2.13394595
## 870 1.74265443
## 874 -0.50695899
## 883 0.03769852
## 891 -2.11443063
## 900 -1.11501290
## 906 -2.05006546
## 912 -0.67675110
## 919 1.73518957
## 926 -1.01910656
## 930 1.71549826
## 933 2.19109159
## 936 -3.56509947
## 939 2.04758136
## 954 -0.26810190
## 966 0.90355143
## 995 2.04348841
## 997 -2.07410216
## 1003 2.05634307
## 1013 -1.26479294
## 1015 -1.52673963
## 1017 1.38181435
## 1018 0.21196643
## 1022 0.91097571
## 1024 1.47760055
## 1028 2.07751082
## 1044 1.61972044
## 1048 -0.96200804
## 1053 -0.11584779
## 1059 2.30742434
## 1076 -1.56883319
## 1078 -1.45541371
## 1080 -1.09232612
## 1083 1.36548704
## 1086 -0.85024814
## 1096 0.75861751
## 1099 -2.41487642
## 1134 2.25061838
## 1153 0.01410041
## 1162 0.74778051
## 1163 -2.49229995
## 1182 0.77545035
## 1186 -2.10102814
## 1217 -0.05143321
## 1223 -0.34162089
## 1225 1.57856179
## 1232 -1.57437095
## 1243 -1.19322376
## 1250 -1.77466090
## 1256 -1.78911218
## 1272 -0.13693750
## 1275 -2.56214323
## 1277 -0.51734606
## 1279 2.48208906
## 1287 -0.89545809
## 1289 2.11516427
## 1301 2.04929820
## 1330 0.69881031
## 1339 1.65073429
## 1349 1.53174019
## 1361 0.22153162
## 1367 1.61919482
## 1370 2.50434905
## 1376 1.72816793
## 1378 -1.22521328
## 1383 -0.98768440
## 1391 1.44732034
## 1399 -0.75688304
## 1413 -2.54294219
## 1419 -2.02940598
## 1422 -1.21921081
## 1427 2.10719364
## 1428 -1.02797597
The confusionmatrix function of the biotools package performs cross-validation of the classification model.
pred <- predict(object = modelo_lda, newdata = datos_test)
confusionmatrix(datos_test$PriceCategory, pred$class)
## new Cheap new Expensive
## Cheap 100 2
## Expensive 10 92
# Classification error percentage
trainig_error <- mean(datos_test$PriceCategory != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 5.88235294117647 %"
The correct classifications rate is 94.11765%.
From a geometric point of view, linear discriminant analysis separates space using a straight line.
pred <- predict(object = modelo_lda, newdata = datos_test, type = "response")
# Computela ROC curve
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))
# Plot ROC curve
plot(roc_curve, main="ROC curve for LDA", col="#1c61b6")
# Add diagonal line
abline(0, 1, lty=2, col = "red")
auc(roc_curve)
## Area under the curve: 0.9865
# Compute confusion Matrix
conf_mat <- confusionMatrix(pred$class, datos_test$PriceCategory)
# Extract values TP, FP, TN, FN
TP <- conf_mat$table[2,2]
FP <- conf_mat$table[1,2]
TN <- conf_mat$table[1,1]
FN <- conf_mat$table[2,1]
# Compute metrics
PPV <- TP / (TP + FP) # Positive Predictive Value
TPR <- TP / (TP + FN) # True Positive Rate o Sensitivity
TNR <- TN / (TN + FP) # True Negative Rate o Specificity
F1 <- 2 * PPV * TPR / (PPV + TPR) # F1 Score
G_mean <- sqrt(TPR * TNR) # G-mean
Accuracy <- (TP + TN) / (TP + FP + FN + TN) # Accuracy
# Compute AUC
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))
AUC <- auc(roc_curve)
# Print metrics
cat("PPV:", PPV, "\n",
"TPR:", TPR, "\n",
"TNR:", TNR, "\n",
"F1-score:", F1, "\n",
"G-mean:", G_mean, "\n",
"Accuracy:", Accuracy, "\n",
"AUC:", AUC, "\n")
## PPV: 0.9019608
## TPR: 0.9787234
## TNR: 0.9090909
## F1-score: 0.9387755
## G-mean: 0.9432648
## Accuracy: 0.9411765
## AUC: 0.9865436
Although the assumption of multivariate normality is not verified, taking into account that the variances are not homogeneous, a quadratic discriminant model is adjusted because it is robust against the lack of this assumption, although it must be kept in mind given the possibility of obtaining unexpected results.
The qda function of the MASS package performs the sorting.
modelo_qda <- qda(PriceCategory ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + YrSold ,data = datos_train)
modelo_qda
## Call:
## qda(PriceCategory ~ LotFrontage + LotArea + OverallQual + OverallCond +
## YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + YrSold, data = datos_train)
##
## Prior probabilities of groups:
## Cheap Expensive
## 0.5 0.5
##
## Group means:
## LotFrontage LotArea OverallQual OverallCond YearBuilt GrLivArea
## Cheap 67.79471 8880.846 5.276961 5.664216 1959.056 1234.218
## Expensive 73.35384 10205.684 7.132353 5.235294 1995.265 1758.924
## X1stFlrSF X2ndFlrSF YrSold
## Cheap 1054.777 172.2819 2007.782
## Expensive 1298.103 459.8799 2007.811
We haven’t added the categorical values to the model because it threw an error (rank deficiency in group Cheap).
The output of this object shows us the prior probabilities of each group, in this case 0.5 and the means of each regressor per group.
Once the classifier is built, we can classify new data based on its measurements by simply calling the predict function. We are going to classify all the observations in the test dataset.
nuevas_observaciones <- datos_test
predict(object = modelo_qda, newdata = nuevas_observaciones)
## $class
## [1] Expensive Expensive Cheap Cheap Cheap Cheap Expensive
## [8] Cheap Cheap Cheap Cheap Cheap Expensive Cheap
## [15] Expensive Cheap Expensive Cheap Cheap Expensive Expensive
## [22] Expensive Cheap Cheap Expensive Expensive Expensive Cheap
## [29] Expensive Expensive Cheap Expensive Cheap Cheap Expensive
## [36] Expensive Expensive Expensive Cheap Expensive Expensive Cheap
## [43] Cheap Cheap Cheap Cheap Expensive Cheap Expensive
## [50] Expensive Expensive Cheap Expensive Cheap Cheap Expensive
## [57] Expensive Cheap Expensive Expensive Expensive Cheap Expensive
## [64] Expensive Cheap Cheap Cheap Expensive Expensive Expensive
## [71] Cheap Expensive Cheap Expensive Cheap Expensive Cheap
## [78] Expensive Expensive Cheap Cheap Cheap Cheap Expensive
## [85] Cheap Cheap Cheap Cheap Expensive Expensive Expensive
## [92] Cheap Cheap Expensive Expensive Expensive Cheap Expensive
## [99] Expensive Cheap Expensive Expensive Expensive Cheap Expensive
## [106] Cheap Cheap Cheap Expensive Expensive Expensive Expensive
## [113] Cheap Expensive Expensive Cheap Expensive Expensive Expensive
## [120] Cheap Cheap Expensive Expensive Cheap Expensive Cheap
## [127] Expensive Cheap Cheap Cheap Expensive Expensive Expensive
## [134] Cheap Expensive Cheap Cheap Cheap Cheap Expensive
## [141] Cheap Expensive Expensive Cheap Expensive Cheap Expensive
## [148] Expensive Cheap Expensive Cheap Cheap Expensive Expensive
## [155] Expensive Expensive Expensive Expensive Cheap Expensive Expensive
## [162] Cheap Cheap Cheap Expensive Cheap Expensive Cheap
## [169] Expensive Expensive Expensive Cheap Expensive Cheap Expensive
## [176] Cheap Expensive Cheap Expensive Cheap Cheap Cheap
## [183] Cheap Expensive Expensive Cheap Expensive Expensive Expensive
## [190] Expensive Expensive Cheap Expensive Expensive Expensive Cheap
## [197] Cheap Expensive Cheap Cheap Cheap Cheap Expensive
## [204] Cheap
## Levels: Cheap Expensive
##
## $posterior
## Cheap Expensive
## 14 1.598651e-02 9.840135e-01
## 29 6.185334e-02 9.381467e-01
## 41 9.618183e-01 3.818166e-02
## 45 9.597117e-01 4.028826e-02
## 53 9.996636e-01 3.363617e-04
## 55 9.969865e-01 3.013546e-03
## 58 1.072621e-03 9.989274e-01
## 72 9.755882e-01 2.441185e-02
## 74 9.926985e-01 7.301516e-03
## 77 9.961080e-01 3.892016e-03
## 78 9.996536e-01 3.463523e-04
## 84 9.991322e-01 8.678486e-04
## 85 1.758521e-02 9.824148e-01
## 98 9.993453e-01 6.547191e-04
## 102 4.098831e-02 9.590117e-01
## 103 6.448357e-01 3.551643e-01
## 104 5.150332e-02 9.484967e-01
## 108 9.999838e-01 1.619422e-05
## 111 9.999890e-01 1.102633e-05
## 131 1.214795e-03 9.987852e-01
## 134 4.243563e-03 9.957564e-01
## 142 4.547363e-03 9.954526e-01
## 143 9.999998e-01 1.624727e-07
## 157 9.935918e-01 6.408201e-03
## 163 8.492884e-03 9.915071e-01
## 168 1.437689e-04 9.998562e-01
## 187 1.009656e-01 8.990344e-01
## 189 8.583566e-01 1.416434e-01
## 193 1.092336e-02 9.890766e-01
## 200 2.016288e-04 9.997984e-01
## 205 9.928382e-01 7.161788e-03
## 213 3.625303e-04 9.996375e-01
## 214 5.554602e-01 4.445398e-01
## 216 9.353098e-01 6.469019e-02
## 220 3.035077e-03 9.969649e-01
## 221 1.228220e-02 9.877178e-01
## 239 3.077326e-04 9.996923e-01
## 241 3.595329e-04 9.996405e-01
## 260 9.997670e-01 2.330396e-04
## 271 1.981789e-04 9.998018e-01
## 274 4.393129e-01 5.606871e-01
## 275 9.993106e-01 6.893748e-04
## 276 9.999430e-01 5.697134e-05
## 289 9.982319e-01 1.768139e-03
## 290 9.999873e-01 1.271531e-05
## 295 9.836750e-01 1.632497e-02
## 299 5.945010e-02 9.405499e-01
## 304 9.966317e-01 3.368294e-03
## 317 2.699865e-03 9.973001e-01
## 337 1.983135e-06 9.999980e-01
## 339 3.219318e-01 6.780682e-01
## 353 9.983393e-01 1.660654e-03
## 356 4.951785e-01 5.048215e-01
## 362 9.999694e-01 3.061819e-05
## 373 8.305816e-01 1.694184e-01
## 375 1.730682e-04 9.998269e-01
## 386 2.919351e-04 9.997081e-01
## 397 9.899145e-01 1.008555e-02
## 410 2.953204e-05 9.999705e-01
## 415 2.766892e-04 9.997233e-01
## 424 2.508788e-05 9.999749e-01
## 425 9.976045e-01 2.395498e-03
## 436 1.528628e-03 9.984714e-01
## 445 9.873844e-04 9.990126e-01
## 446 9.496542e-01 5.034577e-02
## 450 8.565866e-01 1.434134e-01
## 467 7.168933e-01 2.831067e-01
## 470 6.254450e-03 9.937455e-01
## 471 5.809803e-03 9.941902e-01
## 472 4.004270e-03 9.959957e-01
## 476 9.849711e-01 1.502888e-02
## 479 1.840944e-04 9.998159e-01
## 486 9.928581e-01 7.141913e-03
## 516 3.390832e-08 1.000000e+00
## 523 9.765542e-01 2.344580e-02
## 526 2.717020e-02 9.728298e-01
## 533 9.980792e-01 1.920827e-03
## 542 6.957149e-06 9.999930e-01
## 544 7.135313e-02 9.286469e-01
## 549 9.931445e-01 6.855472e-03
## 551 5.829181e-01 4.170819e-01
## 561 8.749257e-01 1.250743e-01
## 562 9.246905e-01 7.530952e-02
## 573 5.329391e-04 9.994671e-01
## 577 9.999536e-01 4.635774e-05
## 587 9.999883e-01 1.168584e-05
## 588 9.642178e-01 3.578216e-02
## 594 5.383447e-01 4.616553e-01
## 596 1.227445e-04 9.998773e-01
## 599 3.660727e-01 6.339273e-01
## 605 1.323418e-02 9.867658e-01
## 610 9.990382e-01 9.617520e-04
## 616 9.916315e-01 8.368457e-03
## 620 5.550637e-05 9.999445e-01
## 622 3.740128e-02 9.625987e-01
## 625 1.956842e-01 8.043158e-01
## 626 9.270346e-01 7.296541e-02
## 629 5.691767e-02 9.430823e-01
## 632 3.080660e-05 9.999692e-01
## 634 9.935470e-01 6.452960e-03
## 641 2.651247e-04 9.997349e-01
## 642 1.028925e-03 9.989711e-01
## 644 3.890896e-01 6.109104e-01
## 647 9.998947e-01 1.053373e-04
## 653 3.085258e-03 9.969147e-01
## 657 9.883382e-01 1.166176e-02
## 664 9.959742e-01 4.025809e-03
## 682 9.999993e-01 6.734452e-07
## 684 3.032820e-05 9.999697e-01
## 690 6.963912e-02 9.303609e-01
## 696 2.122048e-01 7.877952e-01
## 713 5.057171e-03 9.949428e-01
## 718 8.727371e-01 1.272629e-01
## 725 7.302186e-06 9.999927e-01
## 732 1.029524e-01 8.970476e-01
## 735 9.972914e-01 2.708645e-03
## 743 2.001277e-02 9.799872e-01
## 754 4.674900e-05 9.999533e-01
## 767 9.648524e-03 9.903515e-01
## 768 7.017834e-01 2.982166e-01
## 773 9.218587e-01 7.814129e-02
## 775 2.648684e-05 9.999735e-01
## 781 4.746499e-02 9.525350e-01
## 805 9.996526e-01 3.473942e-04
## 806 2.121184e-02 9.787882e-01
## 809 9.956607e-01 4.339348e-03
## 823 6.637024e-04 9.993363e-01
## 835 9.603370e-01 3.966300e-02
## 863 7.787805e-01 2.212195e-01
## 864 9.985762e-01 1.423775e-03
## 865 1.590206e-02 9.840979e-01
## 867 2.695107e-04 9.997305e-01
## 870 1.321993e-03 9.986780e-01
## 874 1.000000e+00 6.070884e-30
## 883 1.642622e-02 9.835738e-01
## 891 9.978902e-01 2.109818e-03
## 900 9.700519e-01 2.994807e-02
## 906 9.998208e-01 1.791535e-04
## 912 9.812232e-01 1.877677e-02
## 919 1.447299e-05 9.999855e-01
## 926 7.093340e-01 2.906660e-01
## 930 1.953564e-05 9.999805e-01
## 933 4.071608e-06 9.999959e-01
## 936 1.000000e+00 3.115341e-09
## 939 2.439865e-03 9.975601e-01
## 954 9.979050e-01 2.094981e-03
## 966 1.302896e-02 9.869710e-01
## 995 3.864569e-07 9.999996e-01
## 997 9.650521e-01 3.494788e-02
## 1003 4.110950e-04 9.995889e-01
## 1013 9.986226e-01 1.377412e-03
## 1015 9.996176e-01 3.824138e-04
## 1017 9.398848e-03 9.906012e-01
## 1018 2.671632e-02 9.732837e-01
## 1022 4.024673e-02 9.597533e-01
## 1024 4.414345e-04 9.995586e-01
## 1028 2.771546e-04 9.997228e-01
## 1044 9.842674e-04 9.990157e-01
## 1048 7.704019e-01 2.295981e-01
## 1053 2.761629e-02 9.723837e-01
## 1059 2.260297e-06 9.999977e-01
## 1076 8.321756e-01 1.678244e-01
## 1078 9.825305e-01 1.746945e-02
## 1080 8.680819e-01 1.319181e-01
## 1083 1.212824e-02 9.878718e-01
## 1086 7.806956e-01 2.193044e-01
## 1096 2.359656e-01 7.640344e-01
## 1099 9.980734e-01 1.926609e-03
## 1134 3.040128e-04 9.996960e-01
## 1153 1.003761e-01 8.996239e-01
## 1162 1.532450e-01 8.467550e-01
## 1163 9.981806e-01 1.819409e-03
## 1182 1.694997e-04 9.998305e-01
## 1186 9.994157e-01 5.843338e-04
## 1217 1.647274e-01 8.352726e-01
## 1223 6.578686e-01 3.421314e-01
## 1225 8.941752e-04 9.991058e-01
## 1232 9.277972e-01 7.220283e-02
## 1243 3.966838e-01 6.033162e-01
## 1250 9.956461e-01 4.353921e-03
## 1256 9.999212e-01 7.884575e-05
## 1272 8.214574e-01 1.785426e-01
## 1275 1.000000e+00 1.620504e-08
## 1277 1.115804e-01 8.884196e-01
## 1279 1.520985e-04 9.998479e-01
## 1287 9.640120e-01 3.598799e-02
## 1289 3.582491e-05 9.999642e-01
## 1301 3.078849e-04 9.996921e-01
## 1330 7.085867e-03 9.929141e-01
## 1339 2.295897e-04 9.997704e-01
## 1349 5.796376e-03 9.942036e-01
## 1361 8.950806e-01 1.049194e-01
## 1367 1.000122e-03 9.989999e-01
## 1370 3.290683e-05 9.999671e-01
## 1376 7.784750e-04 9.992215e-01
## 1378 9.998202e-01 1.797535e-04
## 1383 9.999636e-01 3.643660e-05
## 1391 1.088589e-02 9.891141e-01
## 1399 9.999990e-01 9.822073e-07
## 1413 9.999575e-01 4.246619e-05
## 1419 9.962932e-01 3.706818e-03
## 1422 6.991269e-01 3.008731e-01
## 1427 6.509543e-04 9.993490e-01
## 1428 9.307370e-01 6.926301e-02
The confusionmatrix function of the biotools package performs cross-validation of the classification model.
pred <- predict(object = modelo_qda, newdata = datos_test)
confusionmatrix(datos_test$PriceCategory, pred$class)
## new Cheap new Expensive
## Cheap 91 11
## Expensive 6 96
# Classification error percentage
trainig_error <- mean(datos_test$PriceCategory != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 8.33333333333333 %"
In this case the correct classifications rate is 91.66667%.
pred <- predict(object = modelo_qda, newdata = datos_test, type = "response")
# Compute ROC curve
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))
# Plot ROC curve
plot(roc_curve, main="ROC curve for QDA", col="#1c61b6")
# Add diagonal line
abline(0, 1, lty=2, col = "red")
auc(roc_curve)
## Area under the curve: 0.9648
# Compute confusion Matrix
conf_mat <- confusionMatrix(pred$class, datos_test$PriceCategory)
# Extract values TP, FP, TN, FN
TP <- conf_mat$table[2,2]
FP <- conf_mat$table[1,2]
TN <- conf_mat$table[1,1]
FN <- conf_mat$table[2,1]
# Compute metrics
PPV <- TP / (TP + FP) # Positive Predictive Value
TPR <- TP / (TP + FN) # True Positive Rate o Sensitivity
TNR <- TN / (TN + FP) # True Negative Rate o Specificity
F1 <- 2 * PPV * TPR / (PPV + TPR) # F1 Score
G_mean <- sqrt(TPR * TNR) # G-mean
Accuracy <- (TP + TN) / (TP + FP + FN + TN) # Accuracy
# Compute AUC
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))
AUC <- auc(roc_curve)
# Print metrics
cat("PPV:", PPV, "\n",
"TPR:", TPR, "\n",
"TNR:", TNR, "\n",
"F1-score:", F1, "\n",
"G-mean:", G_mean, "\n",
"Accuracy:", Accuracy, "\n",
"AUC:", AUC, "\n")
## PPV: 0.9411765
## TPR: 0.8971963
## TNR: 0.9381443
## F1-score: 0.9186603
## G-mean: 0.9174419
## Accuracy: 0.9166667
## AUC: 0.9648212
distance<- get_dist(numeric_data)
fviz_dist(distance, gradient = list(low ="yellow", mid = "white", high = "brown"))
Hierarchical clustering is interested in finding a hierarchy based on the closeness or similarity of the data according to the distance considered. In the agglomerative case, we start from a group with the closest observations. The next closest pairs are then calculated and groups are generated in an ascending manner. This construction can be observed visually by means of a dendrogram.
Below it will be illustrated how the groups are defined by the number of vertical lines in the dendrogram, and the selection of the optimal number of groups can be estimated from this same graph.
dendrogram <- hclust(dist(numeric_data, method = 'euclidean'), method = 'ward.D')
ggdendrogram(dendrogram, rotate = FALSE, labels = FALSE, theme_dendro = TRUE) +
labs(title = "Dendrogram")
k2 <- kmeans(numeric_data, centers = 2, nstart = 25)
# Displaying all the fields of the object k2
str(k2)
## List of 9
## $ cluster : Named int [1:1020] 2 1 2 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:1020] "1" "3" "4" "5" ...
## $ centers : num [1:2, 1:10] 43.2 54.4 77.6 65.3 11899.6 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:10] "MSSubClass" "LotFrontage" "LotArea" "OverallQual" ...
## $ totss : num 7.83e+09
## $ withinss : num [1:2] 1.5e+09 1.8e+09
## $ tot.withinss: num 3.3e+09
## $ betweenss : num 4.53e+09
## $ size : int [1:2] 454 566
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
The output provided by the kmeans function is a list of information, including the following:
When displaying the variable k2 it is seen how the groupings result in 2.
k2
## K-means clustering with 2 clusters of sizes 454, 566
##
## Cluster means:
## MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt GrLivArea
## 1 43.24890 77.63516 11899.641 6.550661 5.407489 1981.295 1671.586
## 2 54.37279 65.31116 7678.396 5.948763 5.489399 1973.949 1345.067
## X1stFlrSF X2ndFlrSF YrSold
## 1 1278.601 389.3062 2007.744
## 2 1093.965 247.8039 2007.850
##
## Clustering vector:
## 1 3 4 5 6 7 8 11 12 13 14 15 17 18 19 20
## 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2
## 21 22 23 24 26 27 28 29 31 32 33 34 35 36 38 39
## 1 2 2 2 1 2 1 1 2 2 1 1 2 1 2 2
## 41 43 44 45 46 47 48 50 51 52 53 55 56 58 60 61
## 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 1
## 62 63 64 65 66 68 69 70 72 73 74 75 77 78 80 81
## 2 2 1 2 2 1 2 1 2 1 1 2 2 2 1 1
## 82 83 84 85 91 93 95 97 98 101 102 103 104 105 106 108
## 2 1 2 2 2 1 2 1 1 1 2 2 1 2 2 2
## 110 111 112 113 117 118 120 122 123 124 127 129 130 131 132 133
## 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2
## 134 135 136 137 138 139 140 142 143 144 145 148 152 153 154 157
## 2 1 1 1 1 2 1 1 2 1 2 2 1 1 1 2
## 158 159 161 162 163 165 167 168 169 170 171 174 175 177 178 183
## 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2
## 184 187 189 190 193 195 197 200 201 202 203 204 205 206 207 208
## 1 1 2 2 2 2 2 2 2 1 2 2 2 1 1 1
## 209 210 212 213 214 215 216 217 218 220 221 222 223 224 227 229
## 1 2 1 2 1 1 1 2 1 2 2 2 1 1 1 2
## 230 231 234 235 237 238 239 240 241 245 246 248 249 252 253 254
## 2 2 1 2 2 2 1 2 2 2 1 1 1 2 2 2
## 255 256 257 258 259 260 262 263 264 265 266 267 270 271 273 274
## 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 2
## 275 276 279 280 281 282 283 284 285 287 289 290 293 294 295 296
## 2 2 1 1 1 2 2 2 2 1 2 2 1 1 2 2
## 297 298 299 302 304 306 309 310 311 312 316 317 318 319 320 322
## 1 2 1 1 2 1 1 1 2 2 2 1 2 1 1 1
## 326 327 328 329 331 332 333 334 335 337 338 339 340 341 342 343
## 2 1 1 1 1 2 1 2 2 1 2 1 1 1 2 2
## 344 346 351 352 353 355 356 357 358 359 360 361 362 363 366 367
## 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2
## 368 369 370 371 373 374 375 377 378 380 381 382 383 386 388 389
## 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2
## 390 392 393 395 396 397 398 400 401 402 404 405 406 407 408 409
## 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1
## 410 413 414 415 416 417 419 420 421 422 424 425 427 428 429 434
## 1 2 2 1 2 2 2 2 2 1 2 2 1 2 2 1
## 436 438 443 444 445 446 448 449 450 453 454 455 456 459 460 461
## 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2
## 463 464 466 467 468 469 470 471 472 474 475 476 477 478 479 480
## 2 1 2 1 2 1 2 2 1 1 2 2 1 1 1 2
## 481 482 484 485 486 487 488 492 493 494 498 499 500 502 506 507
## 1 1 2 2 2 1 1 2 1 2 2 2 2 1 2 2
## 508 510 511 512 513 514 516 517 518 519 522 523 525 526 527 531
## 2 2 1 2 2 2 1 1 1 2 1 2 1 2 1 1
## 533 535 537 538 539 540 541 542 543 544 546 547 548 549 550 551
## 2 2 2 1 1 1 1 1 1 2 1 2 2 2 2 2
## 552 553 554 555 556 557 558 560 561 562 565 566 567 568 570 571
## 2 1 2 1 2 1 1 2 1 1 1 2 1 1 2 1
## 572 573 574 575 576 577 578 581 582 585 586 587 588 590 591 592
## 2 1 1 1 2 2 1 1 1 2 1 1 2 2 2 1
## 594 595 596 597 598 599 601 602 603 605 606 607 610 611 612 613
## 2 2 1 2 2 1 1 2 1 1 1 1 2 1 1 1
## 616 617 618 619 620 622 623 625 626 627 628 629 630 632 633 634
## 2 2 2 1 1 1 2 1 1 1 2 1 2 2 1 2
## 635 640 641 642 644 645 646 647 648 649 651 653 654 655 657 658
## 2 2 1 2 1 2 1 2 1 2 2 2 1 1 1 2
## 660 661 664 666 668 669 670 671 672 673 674 675 679 680 681 682
## 1 1 1 1 2 1 1 2 2 1 1 2 1 1 2 2
## 683 684 685 687 689 690 691 694 695 696 697 698 700 701 702 703
## 2 1 1 1 2 2 2 2 2 1 2 2 2 1 2 1
## 705 708 709 710 712 713 716 718 719 720 721 722 723 724 725 726
## 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 2
## 728 729 730 731 732 733 734 735 736 737 738 740 741 743 744 745
## 2 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2
## 747 749 752 753 754 757 758 760 762 763 764 765 766 767 768 769
## 2 1 2 2 1 1 1 1 2 2 2 2 1 1 1 2
## 771 772 773 774 775 776 777 778 779 780 781 782 783 784 786 787
## 2 2 2 1 1 2 1 1 2 1 2 2 1 2 2 1
## 788 791 792 793 794 795 796 797 798 800 801 802 803 805 806 807
## 1 2 1 1 2 1 2 2 2 2 1 2 2 2 1 2
## 809 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825
## 1 1 2 2 2 2 1 1 1 2 2 2 2 1 1 1
## 828 831 833 834 835 836 837 839 840 846 847 848 850 851 852 853
## 2 1 2 1 2 2 2 2 1 1 2 1 2 2 2 2
## 854 857 858 859 860 863 864 865 866 867 868 869 870 871 872 873
## 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 2
## 874 875 876 878 879 881 882 883 884 885 886 887 888 891 892 894
## 1 2 2 2 1 2 1 2 2 2 2 2 1 2 1 1
## 895 896 897 898 900 901 902 903 904 905 906 909 911 912 913 914
## 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 2
## 919 921 923 924 925 926 927 928 929 930 931 932 933 934 936 937
## 1 2 1 2 1 1 1 1 1 1 2 2 1 2 2 1
## 938 939 941 942 945 946 947 948 949 950 952 954 956 958 959 965
## 2 2 1 2 1 2 2 1 1 2 2 1 2 2 2 1
## 966 968 973 974 978 979 980 982 983 984 985 988 989 990 991 994
## 1 2 2 1 2 2 2 1 2 1 1 1 1 2 2 2
## 995 996 997 998 999 1000 1003 1004 1005 1009 1013 1014 1015 1016 1017 1018
## 1 2 1 1 2 2 1 1 2 1 1 2 1 2 1 2
## 1019 1020 1021 1022 1024 1026 1027 1028 1029 1033 1034 1037 1038 1041 1043 1044
## 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1
## 1046 1048 1050 1051 1052 1053 1054 1055 1056 1057 1059 1060 1061 1064 1065 1066
## 1 2 1 2 1 2 2 1 1 2 1 1 2 2 1 1
## 1067 1068 1070 1071 1072 1074 1075 1076 1078 1079 1080 1081 1082 1083 1084 1085
## 2 2 2 1 1 2 2 1 1 2 2 1 2 2 2 1
## 1086 1088 1090 1091 1093 1095 1096 1098 1099 1100 1101 1102 1103 1104 1106 1109
## 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2
## 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1125 1127 1129
## 1 2 1 2 2 2 1 2 2 1 2 2 1 2 2 1
## 1130 1134 1135 1136 1137 1139 1140 1141 1142 1146 1147 1148 1149 1151 1153 1155
## 2 1 2 2 2 1 2 2 1 2 1 1 2 2 1 1
## 1158 1159 1162 1163 1165 1166 1167 1168 1171 1172 1177 1181 1182 1184 1186 1188
## 2 1 1 2 1 2 1 1 1 2 2 1 2 1 2 1
## 1189 1190 1194 1195 1196 1197 1198 1199 1200 1201 1202 1204 1205 1207 1208 1209
## 2 2 2 2 2 1 2 2 1 2 1 2 1 2 2 2
## 1210 1211 1213 1215 1216 1217 1218 1221 1222 1223 1225 1227 1229 1230 1232 1233
## 1 1 2 1 2 2 2 2 2 1 1 1 2 2 2 1
## 1234 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1250 1251 1252 1253
## 1 1 1 2 2 1 1 1 1 1 2 1 2 1 2 2
## 1255 1256 1257 1259 1262 1263 1264 1265 1268 1270 1272 1273 1275 1276 1277 1279
## 2 2 1 2 2 1 1 2 1 1 2 1 2 1 1 2
## 1280 1281 1283 1285 1286 1287 1289 1290 1291 1293 1294 1295 1296 1297 1300 1301
## 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 1
## 1302 1303 1304 1306 1307 1308 1309 1310 1312 1314 1315 1316 1317 1318 1319 1320
## 2 1 2 1 2 2 1 2 2 1 2 1 1 2 1 1
## 1322 1323 1325 1330 1331 1332 1334 1336 1337 1339 1341 1342 1343 1345 1346 1348
## 2 1 1 2 1 1 2 2 2 1 2 1 2 1 2 1
## 1349 1351 1352 1355 1356 1357 1358 1361 1363 1364 1366 1367 1369 1370 1371 1372
## 1 1 2 1 1 2 1 1 1 2 2 2 2 1 2 2
## 1373 1375 1376 1378 1380 1382 1383 1385 1388 1389 1390 1391 1392 1393 1395 1396
## 1 1 1 1 2 1 2 2 2 1 2 2 2 2 2 1
## 1399 1401 1402 1403 1404 1405 1406 1407 1409 1411 1413 1414 1415 1416 1418 1419
## 2 2 2 2 1 1 2 2 2 1 2 1 1 2 1 2
## 1420 1421 1422 1423 1425 1426 1427 1428 1429 1430 1432 1434 1437 1438 1439 1440
## 1 1 2 2 2 1 1 1 2 1 2 1 2 1 2 1
## 1441 1442 1443 1445 1446 1448 1452 1455 1456 1457 1459 1460
## 1 2 1 2 2 1 2 2 2 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 1497003869 1798607593
## (between_SS / total_SS = 57.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(k2,data=numeric_data)
set.seed(123)
fviz_nbclust(numeric_data, kmeans, method = "wss")
set.seed(123)
fviz_nbclust(numeric_data, kmeans, method = "silhouette")
set.seed(123)
gap_stat <- clusGap(numeric_data, FUN = kmeans, nstart = 25,K.max = 10, B = 50)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
fviz_gap_stat(gap_stat)
# Add cluster's labels to the original dataset
data$Cluster <- k2$cluster
# Create a table for comparing clusters and PriceCategory
table(data$PriceCategory, data$Cluster)
##
## 1 2
## Cheap 160 350
## Expensive 294 216
It coincides in the majority of the examples.
adjusted_rand_index <- adjustedRandIndex(data$PriceCategory, data$Cluster)
print(adjusted_rand_index)
## [1] 0.06813152
The ARI value is 0.06957727. ARI is a measure of similarity between 2 cluster assignments (in this case, your PriceCategory variable and the clusters generated by K-means), adjusted by the possibility of random coincidence. ARI values can range from -1 to 1, where: - ARI = 1 indicates a perfect match between the two groupings. - ARI = 0 indicates a match no better than random chance. - ARI < 0 indicates a worse match than random chance. In this case, an approximated value of 0.07 suggests that there’s a slight match between the clusters generated by K-means and the PriceCategory variable, but this match is only slightly better than random chance. However, it is not a strong match, indicating that the clustering performed by K-means does not align significantly with the existing price categories.